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Abstract 


This paper extends single-level missing data methods to efficient estimation of a 
Q-level nested hierarchical general linear model given ignorable missing data with a 
general missing pattern at any of the Q levels. The key idea is to reexpress a desired 
hierarchical model as the joint distribution of all variables including the outcome that 
are subject to missingness, conditional on all of the covariates that are completely ob- 
served; and to estimate the joint model under normal theory. The unconstrained joint 
model, however, identifies extraneous parameters that are not of interest in subsequent 
analysis of the hierarchical model, and that rapidly multiply as the number of levels, 
the number of variables subject to missingness, and the number of random coefficients 
grow. Therefore, the joint model may be extremely high dimensional and difficult 
to estimate well unless constraints are imposed to avoid the proliferation of extrane- 
ous covariance components at each level. Furthermore, the over-identified hierarchical 
model may produce considerably biased inferences. The challenge is to represent the 
constraints within the framework of the @-level model in a way that is uniform without 
regard to Q; in a way that facilitates efficient computation for any number of Q lev- 
els; and also in a way that produces unbiased and efficient analysis of the hierarchical 
model. Our approach yields @-step recursive estimation and imputation procedures 
whose qth step computation involves only level-q data given higher-level computation 
components. We illustrate the approach with a study of the growth in body mass index 
analyzing a national sample of elementary school children. 


KEY WORDS: Child Health; Hierarchical General Linear Model; Ignorable Miss- 
ing Data; Maximum Likelihood; Multiple Imputation. 


Acknowledgements: This research was supported by the Institute of Education Sci- 
ences, U.S. Department of Education, through Grants R305D090022 to NORC and 
R305D130033 to VCU. The opinions expressed are those of the authors and do not 
represent views of the Institute or the U.S. Department of Education. The first author 
was also supported by grant 1U01HL101064 from the National Institutes of Health. 


1 Introduction 


A seminal contribution to statistical methodology is the development of efficient methods 
for handling missing data within the framework of a general linear model (GLM, Rubin, 
1976, 1987, Dempster, Laird, and Rubin, 1977, Meng, 1994, Schafer, 1997, 2003, Little and 
Rubin, 2002). These methods provide efficient estimation of the GLM given incomplete 
data. In particular, model-based multiple imputation now provides state-of-the-art methods 
for handling missing data (Rubin, 1987). These approaches are founded on a comparatively 
mild assumption in many applications that missing data are ignorable (Rubin, 1976, Little 
and Rubin, 2002). 

This paper extends the methodology to an arbitrary Q-level hierarchical GLM where 
lower-level units are nested within higher-level units (Raudenbush and Bryk, 2002, Gold- 
stein, 2003). Many multilevel observational studies and controlled experiments produce 
missing data. In cluster-randomized experiments, the dominant design involves the random 
assignment of whole schools, hospitals, or communities, rather than students, patients, or 
adults to treatments (Bingenheimer and Raudenbush, 2004). Multilevel analysis is pervasive 
in health, education, and social science studies (Datar and Sturm, 2004, Gable, Chang, and 
Krull, 2007, Danner, 2008, Shin and Raudenbush, 2010). Surveys involve multi-stage sam- 
pling designs (Tourangeau, Nord, Lé, Sorongon, and Najarian, 2009). A ubiquitous problem 
is that explanatory as well as outcome variables may be subject to missingness at any of the 
levels. 

In longitudinal studies, hierarchical data subject to missingness may be estimated by 
maximum likelihood (ML) in a structural equation model (SEM) where latent means include 
missing data (Allison, 1987, Muthén, Kaplan, and Hollis, 1987, Muthén, 1993, Arbuckle, 
1996, Enders and Peugh, 2004). SEM software such as Mplus (Muthén and Muthén, 2010), 
Amos (Arbuckle, 2003) and EQS (Bentler, 2007) performs ML estimation of such models. 
When these models are formulated by multi-group analysis, the number of groups is the 
number of missing patterns (Allison, 1987, Muthén et al., 1987, Muthén, 1993). 

Recent advances have extended the single-level methods to multilevel ignorable missing 
data. Liu, Taylor, and Belin (2000) considered Bayes inference to longitudinal designs having 
a fixed within-subject design with repeated measures at level 1 nested within persons at level 
2 where the data are missing at both levels. Schafer and Yucel (2002) developed Bayes and 
ML inference for a broader class of two-level designs in which the level-1 design could vary 
across level-2 units with level-1 data subject to missingness. Goldstein and Browne (2002, 
2005) took a Bayesian approach to a two-level factor analysis where missing outcomes were 
imputed by a Gibbs sampling step. Shin and Raudenbush (2007, 2010) extended these 
methods to a two-level model where the outcome and covariates may have missing values at 
both levels. Shin and Raudenbush (2011) and Shin (2012) illustrated an efficient ML method 
to estimate a three-level model with incomplete data. To estimate a three-level hierarchical 
linear model with incomplete data, Yucel (2008) modified a single-level imputation method 
(Schafer, 1997, 1999) and a two-level imputation method (Schafer and Yucel, 2002) to carry 
out the Gibbs Sampler to sequentially impute cluster-level missing values, intermediate- 
level missing values given the multiply imputed cluster-level data and then, the lowest-level 
missing values given the multiply imputed data at higher levels. These advances guide us 
with continuous outcomes. Goldstein, Carpenter, Kenward, and Levin (2009) and Goldstein 


and Kounali (2009) used a Markov Chain Monte Carlo method to impute a mixture of 
continuous and discrete outcomes subject to missingness in a two-level model. 

Shin and Raudenbush (2007) illustrated two ways to handle two-level missing data: direct 
ML estimation (MLE on Y.,;) and a two-stage procedure of multiple imputation followed by 
the second stage analysis of the multiply imputed data (MLE on Y™). This paper general- 
izes the two methods to an arbitrary number of Q levels and an arbitrary number of outcomes 
defined at any level. A key emphasis in this paper is the difference in logic and assumptions 
between the two methods. Using MLE on Y,»5, one first writes down a desired hierarchical 
model, then reparameterizes the model in terms of the joint distribution of outcome and co- 
variates subject to missingness given the completely observed covariates. Great care must be 
taken so that the transformation is one-to-one in order to insure unbiased estimation (Shin 
and Raudenbush, 2007, 2010). This task generally requires the imposition of constraints on 
regression surfaces at each level to avoid the proliferation of covariance components at higher 
levels of the joint distribution. One challenge for this paper is to formulate these constraints 
within the framework of the @-level model. We show that the unconstrained joint model 
identifies contextual effects (Raudenbush and Bryk, 2002) and interaction effects that are 
typically extraneous for the desired model. In contrast, MLE on Y™ generally implies 
that the imputation model should be unconstrained, allowing the data analyst to impose 
the desired constraints at the second stage when using conventional software to analyze 
the imputed data. A challenge with MLE on Y™ is that the need to avoid constraints 
at stage one may lead to the formulation of extremely high-dimensional imputation models 
that may be difficult to estimate well. The two methods have characteristic advantages and 
disadvantages. MLE on Y,», imposes more assumptions than does MLE on Y™, because 
MLE on Ys imposes distributional assumptions on all variables subject to missingness 
while for MLE on Y™, such assumptions do not affect the observed data. Given the pluses 
and minuses, this paper considers a hybrid approach that combines the two methods. 

Our aim is to formulate a Q-level model that unifies single- and multi-level models into 
a single expression, facilitating extension of existing missing data methods to an arbitrary 
number of levels of a linear model with efficient estimation and computation. The model 
has two representations: a hierarchical linear model for a response variable conditional on 
covariates, and a marginal model that represents the joint distribution of all variables - 
the response and covariates - that are subject to missingness conditional on the completely 
observed covariates. It is essential to clarify the relationship between these two models; 
in particular, the conditional model is always equivalent to the joint model after imposing 
certain constraints on the more general joint model. To clarify the needed constraints in 
a general Q-level setting, we find it revealing to re-parameterize both forms of the model 
such that all variables subject to missingness are decomposed orthogonally by level. In 
this model, random components are correlated within but uncorrelated between levels. The 
required constraints then fall out naturally for any number of levels. 

The next section defines the joint model and decomposes all variables subject to missing- 
ness into orthogonal random components. Section 3 considers the problem of estimation and 
multiple imputation. The orthogonal decomposition is helpful. The key insight is that, if we 
stack the joint model by level, we can write down estimation formulas at level g using level-q 
data only given higher-level computation components that are uniform for all g even if we 
ignore all other-level data. This recursive representation yields efficient computations using 


conventional ML methods such as the EM algorithm (Dempster, Laird, and Rubin, 1977). 
Thus, orthogonal decomposition by level transforms a seemingly intractable computational 
problem into a sequence of familiar, solvable problems as described in Section 3. Section 4 
introduces the conditional hierarchical linear model. We show that the joint model represents 
more parameters than desired in the hierarchical model, and describe how to constrain the 
joint model for identification of the desired hierarchical model. Incorporating the constraints 
is essential to avoid bias for MLE on Y,. For MLE on Y™, the constraints do not reduce 
bias but may nonetheless be practically necessary for computation involving high dimen- 
sional models. Analyzing data from a large, nationally representative longitudinal sample of 
children, we illustrate these methods to study predictors of the growth of body mass index 
(BMI) between ages 5 to 15 in Section 5. This is a three-level problem, with up to seven 
repeated measures on children who are sampled within their elementary schools. Section 6 
concludes with a discussion of limitations and next steps in the Q-level research agenda. 


2 Joint Model 


All of the models described in this article are subsets of a multilevel p-variate model 
Y=p+Zb~N(u,V), b~ N(0,T) (1) 


where every element of Y is subject to missingness, 4 may be a linear function of completely 
observed covariates, Z is a matrix of completely observed covariates having random effects 
band V = ZIIZ’. For simplicity of exposition, we shall assume pz = 0 in this section. We 
partition Y = [Y/ YJ --- Yo |” such that Y; is a vector of p, variables at level ¢ in a hierarchy 
of Q levels for p = pean Pq. In the case of Q = 2 where occasions are nested within children, 
for example, elements of Y; such as body mass index and daily TV viewing hours vary across 
occasions at the lower level 1; and elements of Y2 such as years of highest parent education 
and birth weight vary among children at the higher level 2. 
The variance covariance matrix V may be structured by the fact that Y, varies at level 
q or higher. In the case of Q = 2, for example, the body mass index in Y, varies within 
as well as between children while the birth weight in Yj varies between children but not 
within a child. Thus, we partition Z = Ce, Zq = diag{Z, Z2,-++,ZQ}, a diagonal matrix 
having diagonal submatrices (Z,, Z2,---,Zq), and b = [bf 6} ---b6]", and decompose Y, 
orthogonally by level as 
Yq = Zqbg~ N (0, Vag) ’ bg ~ N (0, qq) (2) 
f0R-Zy = |Zag Za stig? Beals Oy = les, erexiie --eGa)? and Vog = ZqlqqZ7 where Zyq is a 
matrix of known covariates having level-r unit-specific random effects €,¢ ~ N(0,77,) for 
subscript and superscript r denoting the level of variation. The orthogonal random effects 
Erq are independent between levels so that Ij, = @&,,7",, but correlated within levels by 


r=q ° qq? 
0 
COU(Erq; Ers) = Th, 80 that cou(bg, bs) = Igs Oe ao for: q°< 8: ‘Then, Il =. |T,,|, 
r=s "qs 
a matrix with (7,7) submatrices H,; for 7,7 = 1,2,---,@. We may now express Equation 


(2) as: Y= eae Tage having Cou Vi.Ve)- Vig a = y2@__ vu". for g < s where 


r=s “qs 
Ugs = Locate ess and decompose V = [V;,] for i,7 = 1,2,---,Q by the level of variation as 
A viz, 0 0 Ul a5 0 vt “i ops 9 
% 0 0-:-- 0 v2, vs, +++ O Us Veg $25" 
ot eae (cs (ae ee ie ean oo) 
Yo OG: iO? axe +6 (k 20 tated vB) oe ae veg 


To reveal the orthogonal decomposition explicitly, we show all random components of the 
joint model (1) in Table 1. Row label Y, indicates a vector of variables that are decomposed 


Table 1: All random components of Y in Equation (1). 
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into the random components by = [e/, Ey sage -€O¢|' listed in the row. The random compo- 
nents (bi, b2,---,bg) enable us to write down estimation formulas at level g that are uniform 
for all g and that use level-q data only given higher-level computation components. This 
representation facilitates construction of efficient computation formulas as will be explained 
in Section 3. The column label shows the level at which the random components in the 
column vary. Table 1 lists random effects that are correlated within, but uncorrelated across 
columns or levels. Column gq in Table 1 lists level-q unit-specific random effects €, ~ N(0, 7) 
for 


Eql Ti Tp vee Tq 
se) Oe mele ee (a) 

€qq 741 749 oe Gt 
Thus, all random effects b of the joint model (1) may also be expressed as (€1, €2,---,€Q), 
displayed vertically in Table 1. Their parameters are 7 = (71, 72,-+-,7Q). This expression 


is useful for deriving estimators as will be described in the next Section. 


3 Estimation of the Joint Model 


The orthogonal decomposition by level shown in Table 1 enables us to write down the joint 
model (1) as a familiar mixed linear model (2) for any level g. Next, we shall exploit the 
fact that if we stack these level-specific models such that there is an equation at level g and 


a second equation stacked at all levels higher than g, we can write down familiar estimation 
formulas that use level-¢ data only given higher-level computation components even when we 
completely ignore all data at other levels. Moreover, the estimation formulas remain uniform 
for all g to produce efficient computation formulas as will be explained below. Therefore, 
the orthogonal decomposition of the joint model (1) enables us to obtain general, recursive, 
and familiar estimation formulas for the Q-level problem. 


3.1 Structure of Joint Model for All Data at or above Level q 
Based on the model (2) at level g, we stack Y? = [Y" Y,0,---YQ]", Z7 = @%, Z, and 


bt = [bt b7,,-+-b6]" to express the joint model (1) at level q or above as 
Y% = 7901 ~ N(0O,V%), 5b? ~ N(0, HI”) (5) 
Y, Zi). 10 b II Tys@+) 
fon Ye = yor F Zt = 0 zat | = pat | in mata T1 a+) (441) | and 
V. yaatl) 


_ i 
VG = FYING = Par V(atb(qt1 


) | where TI#9t+)) = [Ta (o-a) Tg(g+2) **° Tya| and 


Very ee, Zee Ze cs [Vetas) Va(qt2) °° Vaal for the transpose Zt)? of Za! 
Note that we express Y?, II% and V% uniformly for all q to contain Y¢*!, T¢+Y)@+) and 
Vt)Gt) respectively. For Q = 3, for example, Y? = Z°b? ~ N(0, V**) for Y? = Y3, Z3 = 
23 = 233, b Sh ]e. I? = Ne = een VP Se = 23113323 ; YF he NOV) 108 


aa |. 2] F 0 |=] | me | oe me] aa ve — | V2 va | 


where Zo = [229 232] , by = S I Tloo = @?_» T59, TI23 = II43 = E | and y23 = V3 = 
23 


Follo3 72 and Yr Zh! we N (OV). for Y* = iy | Zi = i I bl = Fa 


Thay Ty? 
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| and yu = | where Z\ = [Z11 221 231], by = €91 |, 
€31 
0 0 0 
i= eat I = [ls Ms) and VY = [Vis Via) for Tie = |r, 0:4 Tig] 0: |, 
0 mp ™}3 
Vig = ZMy2ZF and Viz = Z,11)3Z2. 

Equation (5) for g = 1 expresses the joint model (1) where Z! may include covariates 
having random effects at all levels. Often, the model (5) itself is of interest (Shin and 
Raudenbush, 2011, Shin, 2012). For a positive integer n, let [, denote an n by n identity 
matrix. In this paper, we focus on estimation of the model (5) where Z, = Ip, for all q as 
many applications do. In what follows, we use the Kronecker product A ® B that multiplies 
matrix B to each scalar element of matrix A (Magnus and Neudecker, 1988). In particular, 


I, ® B = diag{B,---, B}. 


3.2 Estimation 


In deriving estimators, it is essential to aggregate the stacked-up joint model (5) at the 
ae level-Q cluster m. We refer to subscript m as a unit at the highest level Q for 
= 1,2,---,Ng; Nom as the number of units at level g nested within the cluster m; and N, = 
a Noes Hereatter., Let: Vor — Youn Yasin Manian ON Caan = [esta Spams genie 
aggregate all Y, and e€,, in cluster m for s < g. The level-q unit-specific random effects in 
column q of Table 1 are aggregated to form the aggregated Equations (4) for cluster m 


q q q 
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Table 1 may also be aggregated to reveal all random components for cluster m. The aggre- 
gated table is of the same form as Table 1 with row label Y,,, replacing Y,, the same column 
label and the vector €,,, instead of €, in column g. The random components in row Yom, of 


the table are bgm = [€fam eG fey --€5am|’ to form the aggregated model (2) 


Yom — Lg Ogi i N(O, Vagm); bam ae N(O, Iqqm) (6) 


for a conformable matrix of known covariates Zgm = [Zqqgm Z(q+1)gm*** Z@qml, Ugam = 

Or, Nim @ Taq 200 Vogm = Lothar where Zggm = INamxpj- Then, cou(bgm, dsm) = 
0 

ee 3 Nem © Tie 

Vash SL opillgethZ ay = en Sota Lp OT, Lor GS: 

Next, we stack ¥2 = [YonYauiym***Yaml’> 24, = Be, Zim and bf, = [BF , Diane? Dom 

to express the aggregated model (5) uniformly for all q as 


i) ee — for g < s, and Yon = O%_, Zram€ram has cou(Yom, Yom) = 
qd q r=q qd qd qd 


Y2 = 24,4, ~ N(0,Via2), 4, ~ N (0, 112) (7) 


Yu. Fe 0 bak ‘cin Taa+) 
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Vics valet) 
Vio = ZEN Zr = yt “I)q vat) (q+1) | Where Tigett) = [Th y(a1)m Ta(qt2)m °°" Tqcqm| 
and Vr) = Zq Alen Ze = [Vagtiym Vo(qt2)m * 7° Vem] By expressing Zgm = 
[TNamxpq 2—aam] having bgm = ios Cl aoal for Zgqam = [Z(q+1)am*** ZQqm| and €-gam = 


eeaaiume coum so that €gqm ~ N(0, Ign © Taq)? €_gqam ~ N(0,®gqm) and nee) = 


Cov(Egams02) 0 ; 
ae ce | = part) me for Bagm = Dregtit Nim ® Tq, we structure V7 = 

I Fig D le Z— cam BU) ZatyT 

ane allgaGrt Ze, ; + Vtbatt) in a familiar form (Shin and Rau- 


denbush, 2007) that is union for all q and has recursive V,@t)G+) = ZattyTt Vat) Ztyr 
Shin and Raudenbush (2007) illustrated how to efficiently estimate the two-level model 


(7) for Yt = [Y,0, Yo)" by ML via the EM algorithm where Yj, and Y2m are vectors of 
arbitrary length. The key insight of the model (7) is to express the familiar two-level form 
Y¢ = [Yj Y@Y7)" uniformly at each level g and estimate it by the method of Shin and 
Raudenbush (2007) given computation components at level q+ 1, starting from the highest 
level ¢ = Q until we estimate the desired model for Y,! with arbitrary Q levels at ¢g = 1. The 
initial step is to estimate the single-level model (7) for g = Q, a special case of Shin and 
Raudenbush’s two-level model. We formalize the recursive estimation within each iteration 
of the EM algorithm after defining notation for estimation below. A major advantage of this 
approach is that the step-q estimation uses only level-g data given higher-level computation 
components for efficient computation as will be shown in the following section. 


To express the relationship between complete and observed data, let Ogm = a O ai, 
be a matrix of dummy indicators for observed values in Yon, = (v2, Yom’ * : ee so 


that Vanicbs = Orr Ge and Lariobs = OunZam = [See ane for Z comers = > Osis f cere, 
Ca and Raudenbush, 2007). At level Q, Nam = 1 and Ogm = Ogim for all m. Let 


Yooons = O4,Y,1 and Z4,, = O4,Z4, for O%, = O2_, Orm to express the observed model (7) 
Yinobs = Lebar = N(0, Va Vinobs = _ OLVEOr = Ze he Ae ee (8) 
Estimation of 7 = (7, 7,---,7g) is carried out via the EM algorithm. The complete 
data (CD) for cluster m may be viewed as bi, given 7. Let € = (bj, b3,--- ON) for the 
entire sample. If we denote €gin, = €g in column gq of Table 1 for unit 7 mesne within cluster 
m, the CD log likelihood of 7 may be expressed as [(z|e) = ve am Se aig Inf (€qim|7) 


for the density f(€gim|7) of level-q unit-specific €gim ~ N(0, a The CD ML estimators 
are ft, = Dee, Nem €qim€sim/Ng- The E components are from b},|Y,h4,, ~ N(b},, Itt) the 
conventional estimation of which requires inversion of V,1’,, that may be extremely high 
dimensional. The orthogonal decomposition by level of the joint model (1) enables expression 
of 61, |/¥25,, ~ N(b%,, 24) that is uniform for all g and based on level-g data only given 
computation components from higher levels. As will be explained below, this recursive 
expression yields successive Q-step estimation formulas from (b%, 1122) down to (01, Hs) 
for efficient computation of the E step without directly inverting V,/’,.,. 
To estimate fixed effects, let = [ui l2--- Wa] for Wg = XqGq in the model (1) where X, is 
a matrix of completely observed covariates having fixed effects 6,. We replace Yz = Zgbg with 
dy = Ya — Xqbq = Zqbq in the level-q model (2) to express the stacked-up model (5 ) as dt = 
¥9 — X%B9 = Z%4 for d? = [d? dt: dG)", X41 = Oe, Xp and 6% = [BF 6G.) -- BGI’. 
Let dom = Yom — Xqmbq aggregate d, = Y, — X,8, such that the conesponding mididel (7) 


has:dl,= V2 — X90 69 = 200%, tor dt, = [a7 Coad dom) and X42 — OP, Mano Then, 
di ons = O8,d4, and X41, = O4,X4%, so that the observed model (8) is sale = es — 


Xo ee = "Zt ba. The desired parameters are 6 = (7,3) for 8 = 81. For simplicity of 
notations, let V,,4, and V“*)) denote the inverses of V,%%,, and V{f VG") | respectively. 


Given current (9, the Fisher scoring equivalent to the Newton-Raphson update is 


—1 No 


A N@ 
B = Bo a o XV S- Dede Rin: ae (9) 
m=1 m=1 


The following section describes efficient recursive computation of B based on X@, V4 x4 


a mobs " mobs~* mobs 
q —q qd 
and X ibs mobs Ci nébs : 


Equation (7) expresses a single-level GLM when Z], is an identity matrix and 0}, has 
Tf! = If"! for all m. The clustering of sample data discussed above gives rise to a Q-level 
GLM. Next, we show that the aggregated joint model (7) enables us to write down efficient 
(-step recursive estimation formulas the gth-step computation of which involves level-qg data 
only and thus is not unduly burdened with respect to the number of Q levels, p variables 
and random effects. 


3.3 Efficient Computation 


The conventional E step based on bL|Y1.,, ~ N(b1,, 111!) requires inversion of V1, which 
may be extremely high dimensional and, thus, take long to compute within each iteration of 
the EM algorithm. On the other hand, the E step based on Equation (8) produces Q-step 


estimation formulas the qth step of which is to estimate b%,|Y2,,,,0 ~ N(b%,, H%) where 


eA. eae = ee 4 ne (10) 
for AL as = De Ve re and Br ows — Ue ps2 robs: Lhe key advantages of the E step 


via Equations (10) are that 67, and II% stay uniform for all g; that computation of b7, and 


1177 uses level-q data only, given At ., Bt), and 6; that the expressions (10) enable efficient 


computation of 6}, and II}} via computation of recursive components A¥%,,,, and BY, 4,; and 
that the E step does not require direct inversion of V,11,.. Estimation of b}, and II1! starts 
at the highest level ¢g = Q with initial components 


Q 
ye lee 


=OnnVin ns Be 


mobs 


= Ob Vins Om (11) 


for Vee = TS = OG eg Obras computes A’, and B! ,. using level-q data only, given 


obs mobs mobs 
1 1 ; : 
Afr, and Bf, . at step q, and finally evaluates Equations (10) given Al,,,, and Bl, at 
q = 1 within each iteration of the EM algorithm. 
To formulate the recursive computation, let 
~ = q =e eal -1 1% 
€_qqm = E(€—gqm|Y moos) x A ym (Z—qamobsY gm Fqmobs a QymE-aam), (12) 
—qil oo) -1 = -17T sil 
mobs ~— qm =D 5 2 aainoba aia” — Guniobs qm) 
—qlil —ql2 
V2 = mobs mobs 
mobs —_ —q2l —q22 
mobs mobs 
_ 7T -1 -1 — qMNan _4 = qtly _ 
where om = 2 = aqmobs¥ m2 —qqmobs a ee Wm 7. Diet TNagim> Quam ~—% var (€—gqm|Ymoba) > 
—pratl) Bat! @a(qt+ DT z ie qtl) _ ga(qtl) qq me f) eel OL 
Daqm oF Been and €—qqm = Fi(€_gqm|Yprobs) a £F. ions for Tagim 2% Ogim™ sO im: 
—qll 


Note that computation of €_gqm and V,,,¢,, requires level-¢ Z—qqmods ANd dgmovs Only, given 


At? BI! and 6. The following result shows that A’, and B’,. depend on level-q data 


mobs? mobs mobs mobs 


only, given higher-level components At. and Bee. See the Appendix for proofs. 


Trubs and Zo Va! mobs Cepend on level-q data Ygmovs; Xqmobs 


Py ath gatt gat Ty Gt) zat] and 6 for allg <Q. 
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Proposition 3.1 77 


mobs 


v-4 


mobs 


and Z_qqmobs only, given Z4 (a4 


ae S 


Proposition 3.1 enables a Q-step recursive computation of bt and Ty the gth step of which 


involves level-g data only without directly inverting V,'1,,. 


Theorem 3.2 bl and Tr! can be computed by a Q-step recursive procedure from level Q 
down to level 1 given 0 where step q involves level-q data only. 


The E step for Q = 3, for example, computes A?,.,, = ee 
OF 733,03m in Equations (11) initially for the inverse 733°, of 73;,,; A?,,,, and B?,,,, in 
Equations (32) and (33) given A?,,,, and B?,.,, at q = 2; A},.,, and Bi..,, in Equations (32) 
and (33) given A?,,,, and A?,.,, to finally yield bt, and ee in Equations (10) at g = 1. 


Om aims and B? = 


mobs mobs 


Fisher scoring on 3 may also be based on recursive computation of A‘? .,,, Ba ons, 
Er be = RON aioe ee Gorobs = DV ets Hoi = ZV ak (13) 

The following result shows that F!.,,, G4,.,, and H#,, depend on level-q data only, given 

1 +1 +1 +1 +1 
ATS os Be oes heer Ge re lee and 0. 

eye T T T 
Proposition 3.3 Dorabs aaa abe XinatsVinobs mobs oe Ce ee ace a ene 
q ne qmobs > X gmobs and Z_qqmobs only, aa Zeb ann Grobs Z mobs Ve. Diss 
+1) yq+1 +1)T +1) +1) 

ene Ty aH) Ginobe) baer Va es ashe “Vee pa and 0 for all q< Q. 


Propositions 3.1 and 3.3 enable 8 to be computed recursively. 


Theorem 3.4 X!7, Voj,.X oop, and X17, Vo sncdnons can be computed by a Q-step recursive 


procedure from level Q down to level 1 given @ where step q involves level-q data only. 


The Fisher score on @ for Q = 3, for example, computes A3,,, = OF ,13a),d3mobs 
Be obs = O85 ™33mO 3m ee > X3mobsT asm 3mobs Ce = Xe wuoha teases X simone and Fe obs = 
Ong tea sap tially Ass Re Fnobs! G? on, and H?,,,, in Equations (32) to (36) 
from level-2 data, given A?,, Boone, Foose, Going and H?,,, at q = 2; Fl.,, and G1... 
in Equations (34) and (35) from level-1 data given A’,o,.; Brobs: Finobs: Gmobs and H7,o5, to 
finally yield 8 in Equation (9) at-g=1, 

The recursive estimation efficiently handles missing data one level at a time. Conse- 
quently, the computation will be efficient given a number of variables subject to missingness 
at higher levels. In that case, the observed joint model (8) yields recursive computation that 
is not excessively burdened with respect to Q, p and the number of random effects. On the 
other hand, given missing data at level 1 only, this approach amounts to the conventional 
EM algorithm. The inverse of the Fisher oe matrix yields var(@). 


6~N (x16 + Zip 7) T11Z17) given the 


mobs? S 


FS 
obs? mobs 


mobs) mobs) 


Multiple imputation is based on Y,"|Y,1 


mobs? mim) “~m-~m~m 
1+pi; __ Ti 
ML 6. Let v,; and vi; be variances of log(m;) and loge where i # j and p;; = vai for di- 
agonal 7;; and off-diagonal 7; in (71, 72,---,7@). Then, N[log(7i), bi] and N (log 1-55") 
aoe 


estimated by ML imply the joint distribution of a vector, ¢,, of distinct log(m:) and log | 
from 7. Let the distributions of ¢, and 6 be N ba, var (b,)| and NIB, var (8)] estimated 
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ML, respectively. To propagate uncertainty in estimation of 6 for proper imputation (Little 
and Rubin, 2002), we randomly generate 6 from NIB, var()] and mq from N ba: var(b,)| for 
all g, and then impute missing data given the @ for each imputation (Shin and Raudenbush, 
2007). 

Thus far, we have focused on estimating the joint model (1) for variables subject to 
missingness given completely observed covariates. However, our goal in this paper is to 
estimate a general Q-level hierarchial GLM for a univariate response conditional on covariates 
where the covariates as well as the response may have ignorable missing data at any of the 
levels. 'To efficiently estimate the GLM, we have to reparameterize it in the form of the joint 
model (1). The next section will introduce the desired GLM, clarify its relationship with the 
joint model, and describe methods to efficiently estimate the GLM via the joint model. 


4 Hierarchical General Linear Model 


The aim of this article is to estimate a Q-level hierarchial GLM that is a special case of the 
joint model (1) in which a univariate response is defined at the lowest level of aggregation. 
We show that the joint model overidentifies the desired model, in general, and describe 
how to constrain the joint model to be a one-to-one transformation of the GLM. Without 
the one-to-one correspondence, the estimated GLM via MLE on Y,», may be substantially 
biased (Shin and Raudenbush, 2007). For simplicity of explication, we first consider the 
desired GLM where all covariates having fixed effects are subject to missingness. This 
consideration is without loss of generality because completely observed covariates having 
fixed effects do not affect the constraints on the joint model. After explaining the needed 
constraints for MLE on Y.y;, we consider a more realistic GLM having both covariates 
subject to missingness and covariates completely observed. 
We write the general Q-level hierarchial GLM 


R=C"y+Dte, e~ N(0,7) (14) 


for C = [AT Yj’---Yg]” having fixed effects y = [yf 73 ---74]’, D = [Df Dz --- DG)" 
having random effects e = [e? es ee eal’, and T = O24 tT, where A and Y, are vectors of 
p; — 1 level-1 and p, level-r covariates having fixed effects 7, and 4,, respectively, D, is a 
vector of pp, covariates having level-r unit-specific random effects e,. ~ N(0,7,) independent 
across levels and pp = een Dpr- Both R and C are subject to missingness while D is known. 
We assume D; = 1 and that D,. carries an intercept as many applications do, although it is 
not required to have one. 

The aim of this article is to efficiently estimate the Q-level hierarchical GLM (14) given 
incomplete data. To do so, we must reparameterize the equation (14) in the form of the 
joint distribution (1) of all variables subject to missingness - including the response and 
covariates at any level given D. We define the first element of Y; as the response R and 
the remaining elements of Y; as covariates A to partition Y; = [R A‘’]* and decompose Y 
into the response and covariates Y = [R C7]? where we decompose R = Shae Dre,p and 
C= lAr yee YA)? =| pe ee Se ree €Og|' orthogonally by level. Then, Equation 


f=1 
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(14) is a special case of the model (1) for 4 = 0 where Y; = [R A‘]? implies partitioning 


T r r r 
Fc Dy; 0 eo €rR opt TRR TRA rw = T Raq (15) 
do 0 L Spl € ’ = 7 rr, > “lq rT, 
pi-l rA AR AA Aq 


and C implies Z,q = Ip, for q > 1. Then, V = COA). Oe | = Ver Varo | 


for 


Q Q Q 
T T T T 
Ver = sD, TrrDr, Vero = SD: TRA baie 2): Tho +> Dat Eo ; (16) 
r=1 r=1 r=2 
Q Q Q 
Tas O ++: O Taa Was 0 TAA TAQ 7 4Q 
0 O --- 0 Tig Se 0 fey. Tey Se ORS 
Voo = eee Se se 
a RePas Q Q Q 
0 O 0 0 0 0 Waa Teo -** Sq 


We-tefer to Y = [0% , Dian yey De ge €Og|' as the form of the joint model (1) 
for efficient estimation of the GLM (14) in this article. Then, the GLM (14) implies 


var(R) = 7? Vecy + D?rD and cov(R, C) = y' Voc. 


When Q = 1, the reparameterization is one-to-one between Equations (1) and (14) and 
no difficulties arise in computation and interpretation as illustrated in Section 4.1. However, 
when Q > 1, we find that the reparameterization required to equate the conditional model 
(14) to the corresponding joint model (1) can be quite challenging. Without imposing 
constraints, the joint model will over-identify the conditional model (14). We can readily 
comprehend this problem in the case of two-level models shown in Sections 4.2 and 4.3. We 
then generalize our approach in the subsequent section. We see that the problem of over- 
identification can become severe as covariates, levels and random coefficients are added to 
the model. 


4.1 Single-Level Model 


Equation (14) for C = A, y = j%1, D = 1 and e = e, expresses the conventional ordinary 
least squares (OLS) regression model as the conditional distribution of R given covariates 
A. Efficient estimation of the model from ignorable missing data (Rubin, 1976, Little and 
Rubin, 2002) is straightforward when we estimate the corresponding joint model (1) 


[Al~*([o] [ve vee ]) a 
for Ver = Trp, Vac = Tha and Voc = 74,4. The OLS model (14) implies Var = yf VeoyitT1 
and Vero = 77 Vcc to yield the one-to-one transformations y7 = VroVoo and 7, = Ver — 


4Vceov. That is, the p; parameters in the OLS model (14) and the (p; — 1)p,/2 variance 
and covariance components in Voc are one-to-one functions of the p;(p; + 1)/2 parameters 
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in the joint model (17). Equivalently, the p; parameters in the OLS model are one-to-one 
functions of (Ver, Vec) without redundantly counting the number of parameters in Voc. 

The general forms of the joint model (1) implied by the equations (16) and the conditional 
model (14) remain intact when we consider the hierarchical linear model with arbitrary Q 
levels. A central concern of interest to this paper is that, when we move beyond the single 
level case for Q = 1, the desired model (14) is not a one-to-one transformation of the joint 
model. To see how this works, we consider two-level data where level-1 units (e.g. students) 
are nested within level-2 units (e.g. schools) before we consider arbitrary Q levels. We shall 
consider the cases of the two-level model with a random intercept and the two-level model 
with random coefficients. 


4.2 Random-Intercept Model 


A comparatively simple two-level hierarchical linear model with a random intercept is of 
form 


R= Aly, + YS 72 +e€i+e€2~ N(A yy + Yo, ™1 + T2), (18) 


a special case of model (14) with C = [A? YI]", 7 = [o7 93)", D = [1 1]", e = [e1 ee]? and 
P= Oi T,. The corresponding joint model (1) is 


1 1 2 2 2 
R €1R €2R Tre Tra O TRR TRA Tro 

1 1 9 2 2 
Al=]@a/]+ | €4|]~N10,| tarp Ta, 0} 4+] Tar Tan Tar (19) 
Y, 0 €9 0 0 0 Rea Tey, “Wes 


Ad 2 al ee = 
where Ver = Tap ter and Vac = [Tea +T Ra Mro| and Veo = 0 0 é 5 
OA» -W99 


We can see now that the desired model (18) constrains the joint model (19) by Var = 
Vy Vooy +7 +72 and Veo = y' Voc such that 


1 2 2 
Taa O us TAA oa 


a0 y a0 
ma=bt ofl 9 | [Bf tm bkal=bran| "of, Go) 
2 T OT TAA TA V1 2 2 1 Eee & TAA 749 
TRR In | m4 7, ‘9 T 72; TRA TRo| =(% 12] m2, m2, sy) 


To see how many constraints the desired model (18) has placed on the joint model (19), 
the constrained model (18) identifies p; + po + 1 parameters while the unconstrained joint 
model (19) identifies 2p; + pg parameters in (Ver, Vrc). Therefore, the constrained model 
(18) has p; — 1 fewer parameters than does the unconstrained joint model (19). The key 
constraints occur in the variances and covariances (20) and (21) where the association 7 
between R and A is constrained to be the same at both levels. An alternative form of the 
unconstrained model (19) replaces 7; with 7; in the equations (20) and y, with 72 in the 
equations (21). This would allow the association between R and A to be different at the 
two levels, inducing what is known in the social science and public health applications as 
a contextual effects model (Shin and Raudenbush, 2010). The constraints (20) and (21) 
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impose 712 — 711 = 0, that is, no contextual effects. 


4.3 Random-Coefficients Model 


As the number pp of random coefficients increases, the number of potentially extraneous 
parameters generated will increase non-linearly if no constraints are imposed. To show how 
aggravated the over-identification can become, consider a random coefficients model that 
adds level-1 covariates Ez having random coefficients to the model (18) 


R= A +Yy 72+ €1 + Does ~ N(A 1 + Yo42, 71 + Dz T2D2), (22) 
another special case of model (14) for C = [A? YJ]?, y = [nd ]?, D = [D, DF), 
1 e€ Ti T 
£. T)\T 2 ps8 a 7 _ | €20 _ | 7200 7201 
e = [e1 e3]° and tT = @ja1 7 where D; = 1, Dz = Bs | eg = Bay [ee Pee an | 


and pp = 1+ pp2. The corresponding joint model (1) is 


T 1 1 T_2 T_2 T 2 
R €1R Ds €2R TRR TRA 0 D3 TepDo Ds TRA Ds TR 
= 1 1 2 2 2 
A =>) sega. || Ae €2A4 ri IN |.0,.| dag Raa. Oo) TARD2 TAA TAQ 
Y5 0 66 0 0 O M3 pDo 35 Ts 


for €or = we | n= Toro Tory | eee Thoa | and a5 = Tho2 } Note that 
€2ri |’ Triro Triri |’ TRA TRI12 

Var = Thre + D3 72pDe2, Vac = [tha + Do 7},4 DF 7}5] and Voc as in the model (19). The 

desired model (22) implies constraining the joint model (23) by Ver = y7 Vecy+m1+Di D2 

and Vac = y' Voc such that 


1 1 
T_T TT 0 V1 T_T TT 0 
TRR = en 72 | a 0 | ‘9 | a i [TRA 0] = en 72 | ae 0 | , (24) 
ne _ on 42) TAA TA9 1 aes [7? mn ~ T r TAA 749 (25) 
RORO 1 Y2 m2, nr, ‘9 200; |" ROA 7 RO2 V1 Ye m4 m2, ) 
T ROR = 7201, TRIRI = T211; TRA = 0, Tino = 0. 


To see how many constraints the desired model (22) has placed on the joint model 
(23), the constrained model (22) identifies p, + po + pp2(pp2 + 1)/2 parameters while the 
unconstrained joint model (23) has p; + pp2(pp2 + 1)/2 + ppe(pi + pe — 1) components in 
(Ver, Vac). Therefore, the model (22) has pp2(p1 + p2 — 1) — po fewer parameters than does 
the unconstrained joint model (23). Again, the key constraints occur in the variances and 
covariances (24) and (25) where not only is the association y, between R and A constrained 
to be the same at both levels, but the covariance components 77,4 and 7%. that yield 
extraneous interaction effects between E> and C' are also set to zero. That is, the desired 
model (22) has no contextual effects of A and no interaction effects between EF, and C’. Next, 
we extend the model (14) to an arbitrary number of Q levels. 
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(23) 


4.4 (-Level Model 


We now focus on how to efficiently estimate the hierarchical GLM (14) for the arbitrary num- 
ber of Q levels. Unlike the single-level case, however, the joint distribution (1) over-identifies 
the desired model (14). This over-identification poses a major computational challenge as it 
represents the components of cov(R,C) = Vrc that are extraneous for subsequent analysis 
and that rapidly multiply as Q, p and pp increase. The consequence is that estimation of 
the over-identified hierarchical model (14) may produce substantially biased inferences in 
the case of MLE on Y,»; or computational problems in the case of MLE on Y™. 

To show the over-identification explicitly, we reexpress all random effects of the joint 
model (1) in Table 1 according to the decomposition Y = [R C?]* as listed in Table 2. 
Column gq lists level-¢ unit-specific random effects €, ~ N(0,7,) that may now be partitioned 


Table 2: All random components of Y in Equation (1). 


fh <9 59) “aes, SO) 
RK. |\ in GR. 3p = OR 
A l€1A €A €34 *** €QA 
Yo €92 +€32 *** €@Q2 
Y3 €33, *** -€Q3 
YQ €QQ 


as 


€qR The TRC Oy oe caer ie 

q 

Eq = cc for ego = |€g4 €72°°* Gaal 

q ’ q q qd qd qA *q2 qq 
€qc Tor "co 


The random effects (€gr, €gc) and their variances and covariances (thp, Téc, Tec) are useful 
for explaining the over-identification problem and the constraints. Notice that the level-q 
unit-specific €, generates ppq(d-/_; pr — 1) covariance components in tho. For Q = 3 with 
three columns, for example, columns 1, 2 and 3 generate ppi(pi — 1), po2(pi + pz — 1) and 
Pp3(pi + pe + p3 — 1) components in cov(eir, €1c) = Tro = Tra, COU(C2R, 2c) = Tao = 
[73.4 Th] and covu(esr, €3c) = Tho = [THA Teo Ths] at levels 1, 2 and 3, respectively, for 
E10 = €1A, €20 = [654 ed]? and esc = [ed €45 €43]7. Overall, the random effects of the joint 
model (1) produce Yige1 PDq 711 Pp — pp covariance components between R and C' while 
the desired model (14) implies p — 1 elements in y. Consequently, the potential for severe 
over-identification exists if no constraints are imposed on Equation (1). 

A key task, then, is to formulate a general approach to imposing constraints, one that 
applies to any value of Q and any number of covariates. The following theorem shows a 
conditional model the joint model (1) identifies so that constraining the joint model amounts 
to constraining the conditional model. Let ec = (€1¢, €2c,:++,€gc) and m¢% be the inverse 
of tho. 
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Theorem 4.1 Joint model (1) represents ye Pbdqd2=1 Pr — Pp covariance components be- 
tween R and C and is a one-to-one transformation of Rlec, D 


Q 
R = So (DIV geo + DP5q), 54 ~~ NO, Tic) (26) 
q=1 


ag eG i, ated @ ar 
for Tg =Trotcc and Trio = Tre —VgteclG- 


For each covariate in Y,, the conditional model (26) expresses the association between 
the covariate and R to be distinct at each level s > q while the desired model (14) rep- 
resents a single effect of the covariate on R. Consequently, the joint model (1) produces 
4 PDq =1 Pr — Pp — (p—1) parameters extraneous for subsequent analysis. The extrane- 
ous parameters, representing the contextual effects of C' and the interaction effects between 
D and €c, rapidly multiply as Q, p and pp grow. Let D, = [1 Ej]" and €gr = [gro €;ril’ 80 
that tho = cou(€gr, qc) = [toc Tcl’. The following corollary to Theorem 4.1 establishes 
one-to-one correspondence between a general contextual effects model and a constrained 
joint model (1) where each level-q covariate has a distinct effect at every level s > q without 
involving interaction effects. 


Corollary 4.2 Joint model (1) under constraints 
Thc =9, Vq (27) 


identifies a general contextual effects model given ec and D 


Q 
R= (yer eee + DI5q), 5g ~ NO, Tye) (28) 
q=l 


Jor s5 = lia ag 


xT q 
Equation (28) is nested within Model (26) for Ty = 1 | = bie | Toe. We define 


the contextual effects of covariates in Y, at level s > q as Yqs — Yq(s—1) (Shin and Raudenbush, 
2010). Corollary 4.2 may involve 7; expressing constraints of different forms. For example, if 
it is desirable for each covariate in A to have a single effect on R in the model (28), then the 
corollary would constrain 77 = [v7 Nay .- le for all q in addition to the constraints (27). 
For another example, if the contextual effects of A are desired at level 2 but no other levels in 
the model (28), then the additional constraints would be yf = 711 and ¥7 = [yf 73q°°* aq)" 
for q > 1. Shin and Raudenbush (2007) imposed 7%,~ = 0 and a single effect of each 
covariate on the joint model (1) to identify the desired model (14) for Q = 2. The following 
corollary to Theorem 4.1 establishes the one-to-one correspondence between the model (14) 
and a constrained joint model (1). 


Corollary 4.3 Joint model (1) under constraints 


TRIC =0 and T ROC a li - _ Ya the, Vq (29) 
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identifies hierarchical model (14). 


Model (14) under the constraints (29) is equivalent to model (28) for Y= [v7 - - aval that 
implies yf We) era = WEA, YF TO, ers = VTYs and Ty = 14) for all s > 1 and all q. 

Given Q-level incomplete data, Equation (1) under constraints (29) identifies hierar- 
chical model (14) while the joint model under partial constraints [, such as constraints 
(27) may identify desired contextual effects of C’ or interaction effects between D and ec 
(Shin and Raudenbush, 2010). All these applications may be carried out via MLE on Yous, 
MLE on Y™ ora hybrid method of imputation following estimation of the constrained joint 
model. The choice will depend on computational feasibility and the goal of the application. 
Given an analyst’s model (14), MLE on Y,,, constrains the joint model (1) to just iden- 
tify the analyst’s model while MLE on Y™ is more generally applicable by enabling the 
data analyst to explore, in addition, contextual effects of C' and interaction effects involving 
D. Consequently, MLE on You; is tailored to estimation of the analyst’s model whereas 
MLE on Y™ estimates an overidentified joint model so that it enables the data analyst to 
explore multiple hierarchical models for correct specification of the analyst’s model. When 
MLE on Y™ is desired, but produces an unconstrained joint model (1) that is extremely 
high dimensional and thus difficult to estimate well, the hybrid method enables estimation of 
fewer parameters and thus reduces computational burden in estimation by imposing partial 
constraints such as Equations (27) on the joint model. 

Now, we consider a more general model (14) 


R=C'y+W'w+Dte, e~ N(0,7) (30) 
for known covariates W = [Wi Wy---Wé]" having fixed effects y = [yd1 Ybo°-* Yeo!” 
and every other component defined identically as the counterpart of the model (14) where 
level-g covariates W, have fixed effects 7, . The corresponding joint model (1) has X, = 
Ip, ® [we Wha ee, Wé | and everything else the same as previously defined. 

The next section illustrates an application to three-level large-scale survey data subject to 
missingness at all levels. We illustrate MLE on Y™, which is more generally applicable than 
MLE on Yn», and the hybrid method. Estimation and multiple imputation are carried out 
by C programs written by the first author. The imputation program uses a random number 
generating library of C routines, RANDLIB 1.3 by Barry W. Brown, James Lovato, Kathy 
Russell and John Venier. Analysis of imputed data and complete-case analysis are carried 
out by HLM 7 (Raudenbush, Bryk, Cheong, Congdon, and du Toit, 2011). The convergence 
criterion is the difference in the observed log-likelihoods between two consecutive iterations 
less than 10~°. The statistical significance is discussed at a significance level a = 0.1. The 
user-friendly two-level program that implements MLE on Y™ is expected to be released 
to the public in software package HLM 7 in the year 2014. The user-friendly three-level 
program is under development at the time of writing this manuscript. 


5 Illustrative Examples 


In this section, we aim to identify the determining factors of body mass index (BMI) during 
childhood that may span three levels, occasions nested within a child attending a school, via 
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analysis of the Early Childhood Longitudinal Study Kindergarten Cohort of 1998 (ECLS-K, 
Tourangeau et al., 2009). Specifically we consider ethnic and social disparities in the growth 
of BMI, and ask how environmental exposures such as television watching and school quality 
are associated with growth in BMI. 

The ECLS-K is a nationally representative sample of 21260 kindergartners in the United 
States who attended 1018 schools in 1998. The study followed the children in fall-kindergarten 
(K) of 1998, spring-K of 1999, fall-first grade (G1) of 1999, spring-G1 of 2000, spring-third 
grade (G3) of 2002, spring-fifth grade (G5) of 2004 and spring-eighth grade (G8) of 2007. 
Due to cost constraints, a random subsample (41% to 54%) of students transferring schools 
were followed from K to G5. Furthermore, the fall-G1 data collection was limited to 27% of 
base-year students in a 30% subsample of the schools. Therefore, the ECLS-K contains many 
item- and unit-nonresponses. For example, only 5044 first graders had their BMI measured 
in fall of 1999. Consequently, researchers have analyzed the ECLS-K without the third wave 
in longitudinal studies of obesity (Gable et al., 2007, Bhargava, Jolliffe, and Howard, 2008, 
Danner, 2008). With the G8 data available since 2009, the longitudinal analysis demands 
challenges as less than 7% of the children attended the same school from K to G8. 

A longitudinal analysis of the ECLS-K should involve all seven waves of data to yield ef- 
ficient analysis. Furthermore, missing data may be present at multiple levels. The approach 
in this paper enables all waves and available data to be analyzed for efficient and unbiased 
inferences. The “all available data” include children with item- as well as unit-nonresponse 
because a child with time-varying characteristics missing but individual or school charac- 
teristics observed strengthens inferences at higher levels (Shin and Raudenbush, 2011, Shin, 
2012). Mobile students transferring schools are nested within their original schools in fall-K 
and analyzed. 

Following the previous studies of the ECLS-K (Datar and Sturm, 2004, Sturm and Datar, 
2005, Danner, 2008, Bhargava et al., 2008), we analyze the raw BMI as a ratio of body weight 
in kg to height in meters squared. Table 3 summarizes the data for analysis of 21,210 children 
who attended 1,018 schools in 1998 after dropping 50 children with most characteristics 
missing including gender and race. Also dropped are 6 eighth-grade BMIs ranging 98 to 207 
that are influential on the fitted regression and 13 extraneous heights and weights such as a 
20-pound weight and a height reduced by more than 10 inches. After dropping the influential 
and extraneous observations, the standard deviation of G8 BMIs reduces from 6.29 to 5.29. 
With 7 occasions nested within most children, there are a total of 148,451 occasions at level 
1 nested within 21,210 children at level 2 attending 1,018 schools at level 3. BMI and the 
daily number of hours spent watching television (TV) are time varying (Gable et al., 2007, 
Bhargava et al., 2008, Danner, 2008). The BMI ranges 7.1 to 57.5. To produce TV, maximum 
daily television viewing hours exceeding 7 per weekday and 10 per weekend day were set to 
7 and 10 hours, respectively. TV was measured in spring-K for the first time and then once 
in every other data collection. It is unreasonable to think that the TV values between fall-K 
and spring-K for each child are different enough to treat all the values missing in fall-K. The 
analysis uses the TV measured in spring-K for the first two data collections. In addition, the 
analysis considers six dummy time indicators from spring-K to G8 to control for the natural 
growth in BMI at level 1; base-year home neighborhood safety (HOMESAFETY), base-year 
age in months (AGE), birth weight in pounds (BIRTHWEIGHT), base-year socioeconomic 
status (SES), a female indicator (FEMALE) and six race ethnicity indicators at child level or 
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Table 3: Data for analysis. K and Gin stand for kindergarten and nth grade respectively. 


Level | Variable Description Mean (SD) Missing (%) 
Time | BMI body mass index 
Fall-K 16.27 (2.20) 2219 (10) 
Spr.-K 16.40 (2.30) 1450 (7) 
Fall-Gl 16.62 (2.60) 16170 (76) 
Spr.-Gl 16.90 (2.86) 5805 (27) 
Spr.-G3 18.66 (3.88) 7476 (35) 
Spr.-G5 20.57 (4.75) 10241 (48) 
Spr.-G8 22.80 (5.29) 12450 (59) 
TV daily TV viewing in hours 
Fall-K 2.01 (1.08) 2772 (13) 
Spr.-K 2.01 (1.08) 2772 (13) 
Fall-Gl 2.44 (1.65) 16374 (77) 
Spr.-Gl 1.88 (1.25) 6018 (28) 
Spr.-G3 1.91 (1.22) 8123 (38) 
Spr.-G5 2.05 (1.22) 10468 (49) 
Spr.-G8 3.09 (1.87) 12227 (58) 
Child | HOMESAFETY | safety around home 1.66 (0.55) 2316 (11) 
AGE age in months 68.41 (4.35) 2132 (10) 
BIRTHWEIGHT | birth weight in Ib 6.91 (1.35) 1472 (7) 
SES socioeconomic status 0.00 (0.80) 1088 ( 5) 
FEMALE 1 if female 0.49 (0.50) 0 (0) 
BLACK 1 if African-American 0.15 (0.36) 0 ( 0) 
HISPANIC 1 if Hispanic 0.18 (0.38) 0 (0) 
ASIAN 1 if Asian 0.06 (0.25) 0 (0) 
PACIFIC 1 if pacific islander 0.01 (0.10) 0 ( 0) 
ALASKAN 1 if American Indian/Alaskan | 0.02 (0.13) 0 ( 0) 
OTHER 1 if multiracial or others 0.03 (0.16) 0 ( 0) 
School | GRAFFITI Graffiti around school 0.55 (0.72) 262 (26) 
PRIVATE 1 if private school 0.26 (0.44) 0 ( 0) 


level 2; and base-year school neighborhood safety (GRAFFITI, the amount of graffiti around 
school) and a private school indicator (PRIVATE) at school level or level 3. 

An unsafe neighborhood is associated with elevated BMI among adults (Shin and Rau- 
denbush, 2007). HOMESAFETY has three scales: not safe or low (0); somewhat safe or 
medium (1); and very safe or high (2) while GRAFFITI, the lower the safer, has four scales: 
none (0); a little (1); some (2); and a lot (3). Preliminary analysis shows that higher-order 
than linear association between the safety factors and BMI is unlikely. BMI and TV miss 
38 % and 44 % of their values overall, and 76% and 77% in fall-G1, respectively. HOME- 
SAFETY, AGE, BIRTHWEIGHT and SES miss 5 to 11 % while GRAFFITI is missing for 
26% of the schools. Complete-case analysis entails removing the 262 schools with missing 
GRAFFITI and dropping 5,381 students attending the schools and their data from analysis. 
The resulting inference is inefficient and may be considerably biased as will be illustrated 
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below. The 2,166 missing birth weights in fall-K were recovered from later data collections. 
The 21,210 students are 49% female and 55 % white. Out of the 1018 schools, 26 % are 
private. The mean BMI grows with acceleration until G5. 


5.1 Random Intercept Model 


The analysis aims to efficiently identify the environmental factors of childhood BMI such 
as television watching and school quality after controlling for natural growth as well as eth- 
nic and social disparities in BMI. At level 1, over time nested within children, we model 
change in BMI as a function of incompletely observed TV and time. At level 2, between 
children nested within schools, we include incompletely observed measure of the safety 
around the home, age, birth weight and socioeconomic status, and completely observed 
female and race ethnicity indicators. At level 3, between schools, we include an incom- 
pletely observed measure of the safety of the school by GRAFFITI and a completely ob- 
served indicator for private school. In terms of our general model (30), we therefore have 
Q=3,R=BMI,A=TV,Y/ =[HOMESAFETY AGE BIRTHWEIGHT SES, 
Yo = GRAPFITI, D) = 1.Ds =] 1. De = 1 WS [Fo 8 A ee VO: TT), 
Ws =(FEMALE BLACK HISPANIC ASIAN PACIFIC ALASKAN OTHER] 
and Wi = [1 PRIV ATE] where T2 through 77 are dummy indicators for spring-K through 
spring-G8. Rather than subjecting the mean growth in BMI to a polynomial curve (Bhar- 
gava et al. 2008; Danner 2008), W, controls for it as the difference in mean BMIs between 
each time point and fall-K. 

Table 4 displays the output. Age and birth weight have been centered around the re- 
spective sample means. The complete-case analysis of the desired model (30) under column 
heading “CC” involves 12446 children attending 706 schools who have both BMI and TV 
observed at one or more occasions. The children have a total of 55082 occasions. The 
next column under “MLE on Y™” presents the MLE on Y™ that uses five imputa- 

1 1 
tions based on the corresponding unconstrained joint model (1) for 7, = TRR "RA | 
TAR TAA 
Tap Tra Wo Tre Tra Tho Ts 


3 3 a is 3 3 
Ti Mig, Wage 0 Tae oT 
TS | age aac Aa || ea Sa at | ee ee | Cee | or respective 
he ee Tor "Moa Mop Tog Temes See 
2R Ta M9 B48 


T3734 32 733 

dimensions 2-by-2, 6-by-6 and 7-by-7. The number of covariance components between R 
and C is Dey PDq d-=1 Pr — Pp = 12 while the desired hierarchical model has p — 1 = 6 
effects of C on R. Consequently, 6 covariance components of the unconstrained joint model 
are extraneous for subsequent analysis. An asterisk ‘*’ marks statistical significance at a 
significance level a = 0.1. The CC standard errors are 8 to 83% higher than those under 
the MLE on Y™. Under the CC, females have 0.09 units higher than do males, and pacific 
islanders and students of multiple or other races have 1.16 and 0.27 units higher than do 
white counterparts, respectively, in BMI on average while these effects are insignificant un- 
der the MLE on Y™, controlling for other covariates. The effect estimates for females and 
pacific islanders under the CC are 25 and 5 times higher than their counterparts under the 
MLE on Y™. The effect estimates for TV, SES, BLACK, ALASKAN and OTHER under 
the CC are also 10 to 67 % higher than their MLE on Y™ counterparts. The CC variances 
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are comparatively underestimated. 


Table 4: Random-intercept model (30) by complete-case analysis (CC), by the MLE on Y™ 
and by the MLE on Y.», based on the unconstrained joint model (1), and random-coefficients 
model (30) by the MLE on Y™. Statistical significance marked by ‘*’ at a significance level 
a=0.1. 


Estimate (Std. Err.) 
CC MLE onY™ | MLE on Yops MLE on Y™ 

Covariate Unconstrained Random Slope 
Intercept 15.88 (0.11)* | 15.94 (0.08)* 11.59 (0.37)* 15.93 (0.09)* 
TV 0.10 (0.01)* 0.08 (0.01)* 0.18 (0.01)* 0.09 (0.01)* 
T2 0.13 (0.02)* 0.13 (0.02)* 0.13 (0.02)* 0.13 (0.02)* 
T3 0.34 (0.04)* 0.32 (0.03)* 0.28 (0.03)* 0.33 (0.03)* 
T4 0.63 (0.03)* 0.63 (0.02)* 0.64 (0.02)* 0.63 (0.02)* 
T5 2.36 (0.03)* 2.36 (0.02)* 2.37 (0.02)* 2.36 (0.03)* 
T6 4.21 (0.03)* 4.24 (0.02)* 4.23 (0.02)* 4.23 (0.02)* 
T7 6.38 (0.03)* 6.40 (0.02)* 6.29 (0.03)* 6.39 (0.03)* 
HOMESAFETY | -0.01 (0.05) 0.00 (0.04) 0.01 (0.04) -0.01 (0.04) 
AGE 0.03 (0.01)* 0.03 (0.00)* 0.03 (0.00)* 0.03 (0.00)* 
BIRTHWEIGHT | 0.31 (0.02)* 0.32 (0.01)* 0.32 (0.02)* 0.32 (0.01)* 
SES -0.30 (0.04)* | -0.27 (0.03)* -0.28 (0.03)* -0.27 (0.03)* 
FEMALE 0.09 (0.05)* 0.00 (0.04) 0.02 (0.04) 0.01 (0.04) 
BLACK 0.52 (0.09)* 0.47 (0.06)* 0.35 (0.06)* 0.48 (0.07)* 
HISPANIC 0.61 (0.08)* 0.62 (0.06)* 0.55 (0.06)* 0.62 (0.06)* 
ASIAN 0.06 (0.13) -0.07 (0.09) -0.09 (0.09) -0.10 (0.09) 
PACIFIC 1.16 (0.37)* 0.25 (0.20) 0.16 (0.20) 0.31 (0.20) 
ALASKAN 0.66 (0.20)* 0.53 (0.17)* 0.47 (0.16)* 0.53 (0.17)* 
OTHER 0.27 (0.16)* 0.16 (0.12) 0.15 (0.13) 0.16 (0.13) 
GRAFFITI 0.04 (0.05) 0.00 (0.04) 0.00 (0.04) 0.00 (0.04) 
PRIVATE -0.03 (0.07) -0.07 (0.06) -0.04 (0.06) -0.07 (0.06) 
T3 0.15 (0.03) 0.17 (0.02) 0.20 (0.03) ea ea 
T2 6.95 (0.10) 7.01 (0.08) 6.98 (0.08) 6.98 (0.07) 
Ty 3.08 (0.02) 3.07 (0.02) 3.05 (0.02) 3.07 (0.02) 


The estimated natural growths of BMI under CC and MLE on Y™ are close to each 
other. From the MLE on Y™, a white student having mean age, birth weight and socioeco- 
nomic status who does not watch TV has 15.94 BMI units on average in fall-K. The 6-month 
BMI growth is 0.13 units in spring-K and then accelerates to 0.19 units (0.32-0.13) in fall-G1, 
to 0.31 units (0.63-0.32) in spring-Gl, to 0.43 units [(2.36-0.63)/4] until spring-G3 and to 
0.47 units |[(4.24-2.36)/4] until spring-G5, and then decelerates to 0.36 units [(6.40-4.24) /6] 
until spring-G8. A polynomial curve may not reveal the details in growth. Controlling for 
the natural growth, and demographic individual and organizational school characteristics, a 
one-hour increment in daily TV viewing elevates child BMI by 0.08 units on average, 64% of 
the 6-month BMI growth in kindergarten. Each month in base-year age and an additional 
pound in birth weight contribute to 0.03 and 0.32 unit increases in BMI, respectively, while 
one unit increase in SES lowers the child BMI by 0.27 units on average, ceteris paribus. 
Black, Hispanic, and American Indian or Alaskan students have 0.47, 0.62 and 0.53 units 
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higher, respectively, than do white counterparts in BMI on average, controlling for other 
covariates. 

The next column under “MLE on Y,,, Unconstrained” illustrates how biased the result- 
ing inferences may be relative to those of the desired model (30) under the MLE on Y™ if 
we are to directly transform the corresponding unconstrained joint model (1) to the desired 
model. The transformed parameters are 


3 q 2 3 3 3 q 
Paes TAA TagtT Tag TAZ v1 Me TAR 
2 3 2 3 3 = 2 3 
TAT, T2192 M3 y2 | = | Tor - TOR | > (31) 
3 3 3 
3A 7139 7133 3 T3R 


Tha Tar || V1 
n= tha ~ TBttha Te = he - In ofl| TH 78 || 
2A 22 2 
compare the estimates to the counterparts under the MLE on Y™. Most strikingly, the key 
environmental effect of television watching (TV) under investigation increases by 126% to 
0.18. The gaps in mean BMIs of black, Hispanic and American Indian or Alaskan students 
relative to white students are noticeably underestimated, and so are the intercept and the 
effects of T3 and T7. The level-1 and -2 error variances are understated while the level- 
3 error variance is over-represented. This example is comparatively benign with only one 
covariate, TV, at level 1 and four covariates at level 2 subject missingness. With more 
covariates subject to missingness at nested levels, this method has the potential to produce 
severely biased inferences. To correctly apply the MLE on Y,», for the desired hierarchical 
model (30), the transformation should follow estimation of the joint model under constraints 
2 2 2 
(29): Tar = TaaN1s vee | = "AA A | n | and tép = TAcy according to Corollary 
UD) TA 729 Y2 
4.3. To see how the constraints work, replace the right hand side in Equation (31) with the 
constraints. 


| and 73 = Tp — y' tac. We 


5.2 Random-Coefficients Model with Missing Data 


The analysis above reveals that black, Hispanic, and American Indian or Alaskan students 
have elevated BMIs relative to white counterparts on average controlling for other covariates 
in the model. The minority students may attend lower-quality schools than those that white 
counterparts attend which, by hypothesis, contributes to the disparity in BMI. School quality 
may be indicated by school characteristics such as school safety, school-mean socioeconomic 
status, contents of school meals, physical education time and school sector (Datar and Sturm 
2004; Gable et al. 2007; Bhargava et al. 2008). If this hypothesis is true, then the minority 
students may have a randomly varying effect on BMI across schools of different qualities. 
Among the minority students, Hispanic students stand out in BMI. Overall, Hispanic stu- 
dents are half as likely to attend private schools as white students. They also attend schools 
having about three times as much graffiti around as those that white students attend on 
average. 

The random-intercept model above is extended to a random-coefficients model where 
the Hispanic indicator has a random effect on BMI across schools. The desired model (30) 
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= d _ | 7300 7301 ? : 
has D3 = HISPANIC I B= ee oe | and all other components identical to those 


of the random intercept model. The corresponding unconstrained joint model (1) has an 
Troro TRori TRoc 
8-by-8 covariance matrix 73 = | Tero Trini Teic | and all other parameters identical 
Tero Ter Tec 
to those of the joint model corresponding to the random-intercept model. The complete- 
case analysis produced virtually identical estimates as those under the CC in Table 4. This 
is not surprising in that the slope of HISPANIC does not vary significantly across schools 
(variance estimate=0.23, standard error=0.17, p-value=0.24). The MLE on Y™ based 
on m = 5 imputations yields the estimates under “MLE on Y™ Random Slope” in the 
Table. Except for the level-3 covariance matrix, the estimates are close to the counterparts 
under the MLE on Y™. The slope for the Hispanic indicator seems to vary at most 
modestly across schools (slope= 0.18, standard error= 0.12). To test the null hypothesis 
that 731; = 0 which implies 739; = 0, let Of and Or be the parameters of the random- 
coefficients (full) and -intercept (reduced) models, and 6% and 6% be the ML estimates, 
respectively, given the tth imputation for t = 1,---,m. The log likelihoods 1(6‘,) and (6) 
evaluated at the ML 64, and 64,, respectively, given the tth imputation yield d; = 2{l(6%) — 
1(6*2)). The test statistic recommended by Li, Meng, Raghunathan, and Rubin (1991a) 
= [d/2 —(m+1)(m— 1)-*r] /(1 +r) = 0.53 where d = 2, d,/m and r = (1+ 


m1) ba (va = va) /(m— 1) for Vd = ym, Vd:/m (Schafer, 1997). The p-value is 


P(Fy, > D) = 0.59 where F,, is a random variable from the F distribution with 2 numerator 
and v denominator degrees of freedom for vy = (m—1)(1+1/r)?/23/" = 974. This test: yields 
an approximate range of p-values between twice and one half the computed value (Li et al., 
1991a, Schafer, 1997). The computed p-value 0.59 gives enough precision to conclude the 
random intercept (null) model. Therefore, we do not find evidence that the slope for the 
Hispanic indicator varies randomly across schools. 

The unconstrained joint model identifies 18 covariance components between R and C 
while the desired random coefficient model has 6 effects of C’ on R. Consequently, 12 
covariance components between R and C’ are extraneous for subsequent analysis. Con- 


straints to identify the desired model via MLE on Y,»5 are Thr = Ais a | = 


Ta TA v1 
ae ae Ton = tooy aud a6 = 0 by Corollary 4:3. 
M24 99 Y2 


6 Discussion 


This paper presented methods for efficient and unbiased analysis of a @-level hierarchical 
general linear model given incomplete data with a general missing pattern at any of the Q 
levels. Our general approach uniformly expresses the Q-level model for Q > 1 that greatly 
facilitates extension of existing single-level and two-level efficient missing data methods to 
general Q-level data; reexpresses the desired model as a joint distribution of the variables, 
including the outcome, that are subject to missingness conditional on all of the covariates 
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that are completely observed; and efficiently estimates the joint distribution. This approach 
confronts two major challenges. As the number of Q levels, the number, p, of variables 
subject to missingness and the number, pp, of random coefficients increase in the hierar- 
chical model, the joint distribution may become extremely high dimensional and difficult to 
estimate well. Moreover, the joint model, in general, over-identifies the desired hierarchical 
model. The problem of over-identification can grow severe as levels, covariates, and random 
coefficients are added to the hierarchical model. The consequence is that the overidenti- 
fied hierarchical model may produce considerably biased inferences as was illustrated in this 
paper. To overcome the computational challenges, we derived, within each iteration of the 
EM algorithm, recursive Q-step computation formulas for efficient estimation of the joint 
distribution where computation at each step involves single-level data only given higher-level 
computation components. The consequence is efficient computation that is not excessively 
burdened with regard to Q, p and pp. Furthermore, we showed how to impose constraints on 
the joint distribution within the framework of the Q-level hierarchical model in a way that 
is uniform without regard to Q; and in a way that produces unbiased and efficient analysis 
of the hierarchical model. 

This paper considered three methods for efficient handling of missing data: MLE on Yous, 
MLE on Y™ and the hybrid method. Given a Q-level hierarchical model with incomplete 
data, MLE on Y,», constrains the joint model to be a one-to-one transformation of the 
desired hierarchical model for unbiased analysis; efficiently estimates the constrained joint 
model by ML; and transforms it to the hierarchical model. Consequently, MLE on Yoy5 is 
tailored to estimation of the desired hierarchical model. On the other hand, MLE on Y™ 
generates multiple imputation given the unconstrained joint model estimated by ML, allow- 
ing the user to impose the desired constraints when using conventional software to analyze 
the imputed data. Therefore, MLE on Y™ is more generally applicable by enabling the 
data analyst to explore, in addition, contextual effects and interaction effects involving co- 
variates having random coefficients. Theorem 4.1 provides the scope of hierarchical models 
that can be explored under the joint model. When MLE on Y™ is desired, but produces an 
unconstrained joint model that is extremely high dimensional and thus difficult to estimate 
well, the hybrid method of imputation following estimation of a partially constrained joint 
model enables estimation of fewer parameters and thus reduces computational burden in 
estimation. The Corollaries to Theorem 4.1 provide the scope of hierarchical models that 
may be explored by the data analyst under the partially constrained joint model. 

We have illustrated the MLE on Y™ by a longitudinal analysis of ECLS-K for Q = 3 
and compared it to the complete-case analysis that was shown to be relatively inefficient and 
subject to biased inferences. We have also compared the MLE on Y™ to the MLE on Yous 
based on the unconstrained joint model that revealed the potential to produce substantially 
biased inferences. The proliferation of extraneous parameters was comparatively moderate 
with Q = 3, p=7 and pp = 3. We have imposed six constraints to generate the results via 
the MLE on Y™ in Table 4. For a model having larger Q, p or pp, it may be desirable 
to use the hybrid method that reduces the computational burden of MLE on Y™ and, 
at the same time, broadens the scope of MLE on Y,», by allowing an analyst to explore 
multiple hierarchical models for the correct specification of the desired model. We may also 
take advantage of extra variables not of direct interest in the desired hierarchical model, 
but highly correlated with variables subject to missingness to more precisely impute missing 
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data at multiple levels (Shin and Raudenbush, 2007). 

The ECLS-K has a great majority of students transferring schools. Thus, it may be 
more appropriate to consider a cross-classified model for the analysis that relaxes the strict 
hierarchy of nesting a student within a single school. Estimation of a cross-classified model 
is challenging because the growths in the outcome of students while attending the same 
schools become dependent to produce a complicated network of dependence among children 
and schools (Raudenbush and Bryk, 2002). All schools for each child and all children for 
each school may have to be analyzed at once to fully account for the dependence. With many 
covariates subject to missingness at multiple levels, the joint model of variables subject to 
missingness may be too highly dimensioned to estimate well. Therefore, a valuable future 
research topic is development of a method for efficient estimation of a cross-classified model 
given incomplete data. 

One restriction of the general @-level hierarchical model is that the covariates having 
random effects should be completely observed. If the covariates are subject to missingness, 
they should appear on the left hand side of the corresponding joint model for efficient han- 
dling of the missing data as well as on the right hand side of the model for estimation of the 
random coefficients. Such a joint model is not multivariate normal, and the factorization 
under joint normality that leads to the desired conditional hierarchical linear model does not 
apply. Consequently, the ML approach is challenging. Relaxing this assumption is beyond 
the scope of the current paper. 

It took 21 seconds to complete each iteration in estimating each joint model in Table 4 
on a 2.8 GHz laptop computer that has 8 GB memory. The estimated random intercept 
and coefficients models took more than 5 hours to converge at 867th and 894th iterations, 
respectively. No attempt to accelerate the convergence has been made. The convergence 
criterion is the difference in the observed log-likelihoods between two consecutive iterations 
less than 10~°. In some of our two-level test runs, we compared computation times for esti- 
mation of a joint model between our program with a convergence criterion of the difference 
in log-likelihoods between two consecutive iterations less than 10~° (Shin and Raudenbush, 
2007), and an alternative program with a convergence criterion of the percentage difference 
in log-likelihoods between two consecutive iterations less than 10~° and the Aitken accel- 
eration (Aitken, 1926), the alternative program converged not only to practically identical 
estimates and standard errors, but up to 90% faster than did our program in terms of the 
number of iterations. Considerable saving in computation time is anticipated with the like- 
wise acceleration in the three-level applications. It took us about 2 minutes to generate a 
single imputation for the results in Table 4. So far, we have developed a three-level pro- 
gram implementing the MLE on Y™ only and thus cannot compare the computation times 
between MLE on Y™ and MLE on Yojg.- 

In Section 5.2, we found no evidence that the slope for the Hispanic student indicator 
varies randomly across schools, based on the test statistic recommended by Li et al. (1991a). 
This test provides an approximate range of p-values between twice and one half the computed 
p-value. More accurate p-values may be obtained at the expense of extra computational 
effort (Li, Raghunathan, and Rubin, 1991b, Meng and Rubin, 1992, Schafer, 1997, Little and 
Rubin, 2002). Because the corollaries to Theorem 4.1 establish one-to-one correspondence 
between hierarchical model (30) and joint model (1), the MLE on Y,»; enables an alternative 
likelihood ratio test between the two analyst’s models directly based on their constrained 
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joint models. 

Our illustrative examples in section 5 are based on a large sample. The performance of 
our estimators in terms of bias and efficiency involving a small sample is yet to be assessed. 
Therefore, simulation studies on the small-sample performance of our methods will be a 
useful future research area. 

The analysis in this paper involved discrete covariates, the safety factors, subject to 
missingness at levels 2 and 3. Although it is improper for the normal linear joint model to 
describe the marginal distribution for the discrete factors, the implied conditional distribu- 
tion is the desired hierarchical model. An advantage is that it allows the discrete covariates 
subject to missingness to be analyzed by the efficient missing data method (Schafer, 1997, 
Shin and Raudenbush, 2007). In addition, the impact of the joint distribution assumptions 
on the desired conditional model by the MLE on Y™ is comparatively weak because the 
distributional assumptions do not affect the observed data. A valuable future extension of 
this approach is to a hierarchical generalized linear model given incomplete data. 


Appendix 
Proof of Proposition 3.1. The model (8) implies 
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Proof of Theorem 3.2. The initial step is to compute A®.,. and BY... in Equations 
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Proof of Proposition 3.3. Equation (8) implies 


T -1 x 
Fa oe emote am {domo ~ Z_qqmobs€—qqm) (34) 
mobs ~_ qt+1 G+ )T @q(qtl1)TQ-1/(z ps ; 
Eps —— The os oF Oil Caer = €_aqm) 
T —qll — —ql2 y-qt+1 
Ga _ ents Viniabs X qmobs Xiah mobs DG ibs (35) 
mobs —_ (xt VrP2 yar \r KX ADT —922 vat) 
qmobs " mobs ~* mobs mobs mobs ~*mobs 


26 


qll ql2 y-qt+1 
H4 _ Lies Vobs X gmobs Zana V ieee Aah (36) 
mobs ~~ (XT a Aas ye Z (q+1)T ee q22 yqtl 
qmobs ° mobs “mobs mobs mobs ~*mobs 
ae s : ql2 7q+1 : 
Where: Ons eens 6 ogne OL V4" in Equations (12), V597Z4*) in Equation (33), 
ql2 yqtl = = 1Q-l@a(a+1) pat 
Vimobs Na obs — Ua, ete ea en re neem 
Qt ))Ty7-Gq22 yqtl ey q+1)T ae YT 1 1 1 a(q+1) pyatl 
AGnobs Viioos De iabs _ Gabe +H or TOR - QQ, ee Ae OL Heaps 


: qtl qtl qt+l qt+1 qt1 
depend on-lével-@: 7 geviahsy: Yomnts 2G comais ONIN; BIVEN A py Be Gayl a ae oe pes le or 
and 0. | 


Proof of Theorem 3.4. The initial step is to compute A®,,, = Ome donee Be os = 
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Proof of Theorem 4.1. Table 2 reveals ey PDq /=1 Pr — Pp Covariance components 
in (who, Tac, +++, Tec) between R and C and implies E(€grlégc) = Vgéqc and var(€gr|éqc) = 
Tho for all g to identify model (26). Conversely, Equation (26) implies E(R|D) = 0, 
var(R|D) = 24 Di tkrD, and cou(R,C|D) = bm Dinh, Spo Di tie DEnRo| which, 
along with the marginal C, is one-to-one with the joint model (1). fj 
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cluster m and zeroes elsewhere for r < @ and Aggimobs = ZQqmobs for r = Q. Let yy be a 
vector of distinct elements in 7, so that y = [y§ yG_, --- y{]* and E, = dvecn,/dp". The 
Fisher information is 


OvecV} dvecVi, Ne 
a mobs V week | mobs xXIT vo xi 37 
tag mt Py ( OyT ( or ® S| OyT x mobs ' mobs~* mobs ( ) 
—_pobs OvecV1 
or Saas = Dea Da (Gres Arvinte ® Baa Angina) Er and CGP)” Va Va) 
Vipate =y% 1 oat ae 1 pay EX (O71 A 1 y q/i'mobs Vote = 1 Arcinisis) & 
(e"., An gitmossV: one 7 - 1 Argimobe) Ee. 


Likelihood 
The observed log-likelihood 1(4|doss) & — Sim Neu (log|V,b one] + dennsVircbetnode) Measures con- 
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vergence to ML for doys = (dicbs, doops, °° *, d Waste): Recursive components are 


Nis 


log| Vi obs| = a log| To cim!| 20 log|Agm| a log|Qqm| 2% log|V2553I (38) 
i=1 
= -1)T +1 t 
eV asa = 7 Nee Vas ee ot [domobs _ Gigi etre.) (39) 
Noam 


x ‘ap Te tisin (Carnie _ Vi ne ear re 
i=1 
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